{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "V8k9buNKTfud"
   },
   "source": [
    "# Label and feature engineering\n",
    "\n",
    "This lab is *optional*. It demonstrates advanced SQL queries for time-series engineering. For real-world problems, this type of feature engineering code is essential. If you are pursuing a time-series project for open project week, feel free to use this code as a template. \n",
    "\n",
    "---\n",
    "\n",
    "Learning objectives:\n",
    "\n",
    "1. Learn how to use BigQuery to build time-series features and labels for forecasting\n",
    "2. Learn how to visualize and explore features.\n",
    "3. Learn effective scaling and normalizing techniques to improve our modeling results\n",
    "\n",
    "**Note: In the previous lab we explored the data, if you haven’t run the previous notebook, go back to [optional_1_data_exploration.ipynb](../solutions/optional_1_data_exploration.ipynb) and run it.**\n",
    "\n",
    "Now that we have explored the data, let's start building our features, so we can build a model.\n",
    "\n",
    "<h3><font color=\"#4885ed\">Feature Engineering</font> </h3>\n",
    "\n",
    "Use the `price_history` table, we can look at past performance of a given stock, to try to predict it's future stock price. In this notebook we will be focused on cleaning and creating features from this table. \n",
    "\n",
    "There are typically two different approaches to creating features with time-series data. \n",
    "\n",
    "**One approach** is aggregate the time-series into \"static\" features, such as \"min_price_over_past_month\" or \"exp_moving_avg_past_30_days\". Using this approach, we can use a deep neural network or a more \"traditional\" ML model to train. Notice we have essentially removed all sequention information after aggregating. This assumption can work well in practice. \n",
    "\n",
    "A **second approach** is to preserve the ordered nature of the data and use a sequential model, such as a recurrent neural network. This approach has a nice benefit that is typically requires less feature engineering. Although, training sequentially models typically takes longer.\n",
    "\n",
    "In this notebook, we will build features and also create rolling windows of the ordered time-series data.\n",
    "\n",
    "<h3><font color=\"#4885ed\">Label Engineering</font> </h3>\n",
    "\n",
    "We are trying to predict if the stock will go up or down. In order to do this we will need to \"engineer\" our label by looking into the future and using that as the label. We will be using the [`LAG`](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#lag) function in BigQuery to do this. Visually this looks like:\n",
    "\n",
    "![](https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/blogs/gcp_forecasting/rolling_window.gif?raw=true)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "WBBSZf_uTdGy"
   },
   "source": [
    "## Import libraries; setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "collapsed": true,
    "id": "kC9RZRlqTfuj",
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "PROJECT = 'your-gcp-project'  # Replace with your project ID."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "collapsed": true,
    "id": "IjsuN9heTfue",
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "from google.cloud import bigquery\n",
    "from IPython.core.magic import register_cell_magic\n",
    "from IPython import get_ipython\n",
    "\n",
    "bq = bigquery.Client(project = PROJECT)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "collapsed": true,
    "id": "xyaeBdzMTdG2",
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Allow you to easily have Python variables in SQL query.\n",
    "@register_cell_magic('with_globals')\n",
    "def with_globals(line, cell):\n",
    "    contents = cell.format(**globals())\n",
    "    if 'print' in line:\n",
    "        print(contents)\n",
    "    get_ipython().run_cell(contents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "collapsed": true,
    "id": "VHEy7L2EW-ug",
    "jupyter": {
     "outputs_hidden": true
    },
    "outputId": "b8a94157-c6d9-425d-b124-cead5d7d9d3f"
   },
   "outputs": [],
   "source": [
    "def create_dataset():\n",
    "    dataset = bigquery.Dataset(bq.dataset(\"stock_market\"))\n",
    "    try:\n",
    "        bq.create_dataset(dataset)  # Will fail if dataset already exists.\n",
    "        print(\"Dataset created\")\n",
    "    except:\n",
    "        print(\"Dataset already exists\")\n",
    "\n",
    "create_dataset()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Ip9SZU7CTful"
   },
   "source": [
    "## Create time-series features and determine label based on market movement"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "WduqaabdTfum"
   },
   "source": [
    "### Summary of base tables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**TODO**: How many rows are in our base tables `price_history` and `snp500`?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 77
    },
    "colab_type": "code",
    "collapsed": true,
    "id": "GEmgSKBNTdG_",
    "jupyter": {
     "outputs_hidden": true
    },
    "outputId": "845d17a4-83a0-4314-d888-8cc29ee90c2b"
   },
   "outputs": [],
   "source": [
    "%%with_globals\n",
    "%%bigquery --project {PROJECT}\n",
    "--# TODO"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 77
    },
    "colab_type": "code",
    "collapsed": true,
    "id": "Utn87x_ATdHC",
    "jupyter": {
     "outputs_hidden": true
    },
    "outputId": "9d4a74d7-1ad0-4510-adf0-91c7ee7e87b5"
   },
   "outputs": [],
   "source": [
    "%%with_globals\n",
    "%%bigquery --project {PROJECT}\n",
    "--# TODO"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "at7EL7pITfuq"
   },
   "source": [
    "### Label engineering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "pQ7R1VcWeJq9"
   },
   "source": [
    "Ultimately, we need to end up with a single label for each day. The label takes on 3 values: {`down`, `stay`, `up`}, where `down` and `up` indicates the normalized price (more on this below) went down 1% or more and up 1% or more, respectively. `stay` indicates the stock remained within 1%.\n",
    "\n",
    "The steps are:\n",
    "\n",
    "1. Compare close price and open price\n",
    "2. Compute price features using analytics functions\n",
    "3. Compute normalized price change (%)\n",
    "4. Join with S&P 500 table\n",
    "5. Create labels (`up`, `down`, `stay`)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "1FvPgI6UOoQO"
   },
   "source": [
    "<h3><font color=\"#4885ed\">Compare close price and open price</font> </h3>\n",
    "\n",
    "For each row, get the close price of yesterday and the open price of tomorrow using the [`LAG`](https://cloud.google.com/bigquery/docs/reference/legacy-sql#lag) function. We will determine tomorrow's close - today's close."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "vBJzyVtCTfur"
   },
   "source": [
    "#### Shift to get tomorrow's close price."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Learning objective 1**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "collapsed": true,
    "id": "IreuNo_pTfus",
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "%%with_globals print\n",
    "%%bigquery df --project {PROJECT}\n",
    "CREATE OR REPLACE TABLE `stock_market.price_history_delta`\n",
    "AS\n",
    "(\n",
    "WITH shifted_price AS\n",
    "  (\n",
    "  SELECT *,\n",
    "  (LAG(close, 1) OVER (PARTITION BY symbol order by Date DESC)) AS tomorrow_close\n",
    "  FROM `stock_src.price_history`\n",
    "  WHERE Close > 0\n",
    "  )\n",
    "SELECT a.*,\n",
    "(tomorrow_close - Close) AS tomo_close_m_close\n",
    "FROM shifted_price a\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 1000
    },
    "colab_type": "code",
    "collapsed": true,
    "id": "adrk4vc1TdHM",
    "jupyter": {
     "outputs_hidden": true
    },
    "outputId": "6d6fbba1-9b94-4b96-fd49-6ceb39fa4ccb"
   },
   "outputs": [],
   "source": [
    "%%with_globals\n",
    "%%bigquery --project {PROJECT}\n",
    "SELECT *\n",
    "FROM stock_market.price_history_delta\n",
    "ORDER by Date\n",
    "LIMIT 100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "8UnOKtvdTdHO"
   },
   "source": [
    "**TODO**: Historically, we know that the stock market has been going up. Can you think of a way to verify this using our newly created table `price_history_delta`?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Learning objective 2**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 77
    },
    "colab_type": "code",
    "collapsed": true,
    "id": "XY2MetOeTfux",
    "jupyter": {
     "outputs_hidden": true
    },
    "outputId": "9310bac5-2d0e-468b-fa8e-a981f09b4b1c"
   },
   "outputs": [],
   "source": [
    "%%with_globals print\n",
    "%%bigquery --project {PROJECT}\n",
    "SELECT\n",
    "--# TODO: verify the stock market is going up -- on average.\n",
    "FROM\n",
    "stock_market.price_history_delta"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "efb9PCBdTfu0"
   },
   "source": [
    "### Add time series features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "S_vRjdyhOqZi"
   },
   "source": [
    "<h3><font color=\"#4885ed\">Compute price features using analytics functions</font> </h3>\n",
    "\n",
    "In addition, we will also build time-series features using the min, max, mean, and std (can you think of any over functions to use?). To do this, let's use [analytic functions]() in BigQuery (also known as window functions). \n",
    "```\n",
    "An analytic function is a function that computes aggregate values over a group of rows. Unlike aggregate functions, which return a single aggregate value for a group of rows, analytic functions return a single value for each row by computing the function over a group of input rows.\n",
    "```\n",
    " Using the `AVG` analytic function, we can compute the average close price of a given symbol over the past week (5 business days):\n",
    "```python\n",
    " (AVG(close) OVER (PARTITION BY symbol\n",
    "  ORDER BY Date\n",
    "  ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING)) / close\n",
    "  AS close_avg_prior_5_days\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Learning objective 1**\n",
    "\n",
    "**TODO**: Please fill in the `# TODO`s in the below query"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 840
    },
    "colab_type": "code",
    "collapsed": true,
    "id": "pBi_CruzTfu0",
    "jupyter": {
     "outputs_hidden": true
    },
    "outputId": "7a7c57f8-cb57-4c81-9786-c785c9c4c518"
   },
   "outputs": [],
   "source": [
    "def get_window_fxn(agg_fxn, n_days):\n",
    "    \"\"\"Generate a time-series feature. \n",
    "    \n",
    "    E.g., Compute the average of the price over the past 5 days.\"\"\"\n",
    "    SCALE_VALUE = 'close'\n",
    "    sql = '''\n",
    "    ({agg_fxn}(close) OVER (PARTITION BY (# TODO)\n",
    "                      ORDER BY (# TODO)\n",
    "                      ROWS BETWEEN {n_days} (# TODO)))/{scale}\n",
    "                      AS close_{agg_fxn}_prior_{n_days}_days'''.format(\n",
    "                          agg_fxn=agg_fxn, n_days=n_days, scale=SCALE_VALUE)\n",
    "    return sql\n",
    "\n",
    "\n",
    "WEEK = 5\n",
    "MONTH = 20\n",
    "YEAR = 52*5\n",
    "\n",
    "agg_funcs = ('MIN', 'MAX', 'AVG', 'STDDEV')\n",
    "lookbacks = (WEEK, MONTH, YEAR)\n",
    "sqls = []\n",
    "\n",
    "for fxn in agg_funcs:\n",
    "    for lookback in lookbacks:\n",
    "        sqls.append(get_window_fxn(fxn, lookback))\n",
    "time_series_features_sql = ','.join(sqls)  # SQL string.\n",
    "\n",
    "def preview_query():\n",
    "    print(time_series_features_sql[0:1000])\n",
    "preview_query()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 31
    },
    "colab_type": "code",
    "collapsed": true,
    "id": "4WX4VFSvTfu2",
    "jupyter": {
     "outputs_hidden": true
    },
    "outputId": "41cdcb4d-ccbd-4e12-9c6c-88de2d4538f5"
   },
   "outputs": [],
   "source": [
    "%%with_globals print\n",
    "%%bigquery --project {PROJECT}\n",
    "CREATE OR REPLACE TABLE stock_market.price_features_delta\n",
    "AS\n",
    "SELECT * \n",
    "FROM \n",
    "    (SELECT *,\n",
    "    {time_series_features_sql},\n",
    "    -- Also get the raw time-series values; will be useful for the RNN model.\n",
    "    (ARRAY_AGG(close) OVER (PARTITION BY symbol\n",
    "                    ORDER BY Date\n",
    "                    ROWS BETWEEN 260 PRECEDING AND 1 PRECEDING))\n",
    "                    AS close_values_prior_260,\n",
    "    ROW_NUMBER() OVER (PARTITION BY symbol ORDER BY Date) AS days_on_market\n",
    "    FROM stock_market.price_history_delta)\n",
    "WHERE days_on_market > {YEAR}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "%%bigquery --project {PROJECT}\n",
    "SELECT *\n",
    "FROM stock_market.price_features_delta \n",
    "ORDER BY symbol, Date\n",
    "LIMIT 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "EjGaQYuRTfu6"
   },
   "source": [
    "#### Compute percentage change, then self join with prices from S&P index. \n",
    "\n",
    "We will also compute price change of S&P index, GSPC. We do this so we can compute the normalized percentage change."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "zL55y-YnOvOu"
   },
   "source": [
    "<h3><font color=\"#4885ed\">Compute normalized price change (%)</font> </h3>\n",
    "\n",
    "Before we can create our labels we need to normalize the price change using the S&P 500 index. The normalization using the S&P index fund helps ensure that the future price of a stock is not due to larger market effects. Normalization helps us isolate the factors contributing to the performance of a stock_market.\n",
    "\n",
    " Let's use the normalization scheme from by subtracting the scaled difference in the S&P 500 index during the same time period.\n",
    " \n",
    " In Python: \n",
    "```python\n",
    "# Example calculation.\n",
    "scaled_change = (50.59 - 50.69) / 50.69\n",
    "scaled_s_p = (939.38 - 930.09) / 930.09\n",
    "normalized_change = scaled_change - scaled_s_p\n",
    "assert normalized_change == ~1.2%\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "collapsed": true,
    "id": "0uz1Qo0STfu7",
    "jupyter": {
     "outputs_hidden": true
    },
    "outputId": "67e1be37-8729-47d5-fde9-9c4f87da82b5"
   },
   "outputs": [],
   "source": [
    "scaled_change = (50.59 - 50.69) / 50.69\n",
    "scaled_s_p = (939.38 - 930.09) / 930.09\n",
    "normalized_change = scaled_change - scaled_s_p\n",
    "print('''\n",
    "scaled change: {:2.3f}\n",
    "scaled_s_p: {:2.3f}\n",
    "normalized_change: {:2.3f}\n",
    "'''.format(scaled_change, scaled_s_p, normalized_change))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "HY9AJAN3Tfu-"
   },
   "source": [
    "### Compute normalized price change (shown above)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "m6OhYVoITdHd"
   },
   "source": [
    "Let's join scaled price change (tomorrow_close / close) with the [gspc](https://en.wikipedia.org/wiki/S%26P_500_Index) symbol (symbol for the S&P index). Then we can normalize using the scheme described above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Learning objective 3**\n",
    "\n",
    "**TODO**: Please fill in the `# TODO` in the code below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "collapsed": true,
    "id": "_W71_cb4TdHe",
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "snp500_index = 'gspc'  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 31
    },
    "colab_type": "code",
    "collapsed": true,
    "id": "b1PNvxhuTfu_",
    "jupyter": {
     "outputs_hidden": true
    },
    "outputId": "c9b87b11-cff2-452f-8c29-147290f95e1f"
   },
   "outputs": [],
   "source": [
    "%%with_globals print\n",
    "%%bigquery --project {PROJECT}\n",
    "CREATE OR REPLACE TABLE stock_market.price_features_norm_per_change\n",
    "AS\n",
    "WITH \n",
    "all_percent_changes AS\n",
    "    (\n",
    "    SELECT *, (tomo_close_m_close / Close) AS scaled_change\n",
    "    FROM `stock_market.price_features_delta`\n",
    "    ),\n",
    "s_p_changes AS\n",
    "    (SELECT\n",
    "     scaled_change AS s_p_scaled_change,\n",
    "     date\n",
    "     FROM all_percent_changes\n",
    "     WHERE symbol=\"{snp500_index}\")\n",
    "\n",
    "SELECT all_percent_changes.*,\n",
    "       s_p_scaled_change,\n",
    "       (# TODO) AS normalized_change          \n",
    "FROM \n",
    "    all_percent_changes LEFT JOIN s_p_changes\n",
    "    --# Add S&P change to all rows\n",
    "    ON all_percent_changes.date = s_p_changes.date"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "5lcs6_BtTfvB"
   },
   "source": [
    "#### Verify results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "collapsed": true,
    "id": "0G1SbI8kTdHl",
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "%%with_globals print\n",
    "%%bigquery df --project {PROJECT}\n",
    "SELECT *\n",
    "FROM stock_market.price_features_norm_per_change\n",
    "LIMIT 10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 299
    },
    "colab_type": "code",
    "collapsed": true,
    "id": "BeNiVymgTdHn",
    "jupyter": {
     "outputs_hidden": true
    },
    "outputId": "f7534321-713c-483d-ba4f-d096c59296fa"
   },
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "8TFFeA5sOm2Y"
   },
   "source": [
    "<h3><font color=\"#4885ed\">Join with S&P 500 table and Create labels: {`up`, `down`, `stay`}</font> </h3>\n",
    "\n",
    "Join the table with the list of S&P 500. This will allow us to limit our analysis to S&P 500 companies only.\n",
    "\n",
    "Finally we can create labels. The following SQL statement should do:\n",
    "\n",
    "```sql\n",
    "  CASE WHEN normalized_change < -0.01 THEN 'DOWN'\n",
    "       WHEN normalized_change > 0.01 THEN 'UP'\n",
    "       ELSE 'STAY'\n",
    "  END\n",
    " ```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Learning objective 1**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "collapsed": true,
    "id": "iv8i3e8GTdHq",
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "down_thresh = -0.01\n",
    "up_thresh = 0.01"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**TODO**: Please fill in the `CASE` function below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "collapsed": true,
    "id": "-Kf5POU6TfvM",
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "%%with_globals print\n",
    "%%bigquery df --project {PROJECT}\n",
    "CREATE OR REPLACE TABLE stock_market.percent_change_sp500\n",
    "AS\n",
    "SELECT *,\n",
    "    CASE \n",
    "    --# TODO\n",
    "    END AS direction\n",
    "FROM stock_market.price_features_norm_per_change features\n",
    "INNER JOIN `stock_src.snp500`\n",
    "USING (symbol)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 136
    },
    "colab_type": "code",
    "collapsed": true,
    "id": "jQzSbN2yTdH0",
    "jupyter": {
     "outputs_hidden": true
    },
    "outputId": "7a58fcb5-9a31-4c1b-fddb-ef3a5b379f2e"
   },
   "outputs": [],
   "source": [
    "%%with_globals print\n",
    "%%bigquery --project {PROJECT}\n",
    "SELECT direction, COUNT(*) as cnt\n",
    "FROM stock_market.percent_change_sp500\n",
    "GROUP BY direction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "collapsed": true,
    "id": "OLYTEUstTfva",
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "%%with_globals print\n",
    "%%bigquery df --project {PROJECT}\n",
    "SELECT *\n",
    "FROM stock_market.percent_change_sp500\n",
    "LIMIT 20"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 202
    },
    "colab_type": "code",
    "collapsed": true,
    "id": "iiARkRzPTdH5",
    "jupyter": {
     "outputs_hidden": true
    },
    "outputId": "2ed8fc43-f8c7-4827-9033-12b142917c9a"
   },
   "outputs": [],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset is still quite large and the majority of the days the market `STAY`s. Let's focus our analysis on dates where [earnings per share](https://en.wikipedia.org/wiki/Earnings_per_share) (EPS) information is released by the companies. The EPS data has 3 key columns surprise, reported_EPS, and consensus_EPS: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "%%with_globals print\n",
    "%%bigquery --project {PROJECT}\n",
    "SELECT *\n",
    "FROM `stock_src.eps`\n",
    "LIMIT 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The surprise column indicates the difference between the expected (consensus expected eps by analysts) and the reported eps. We can join this table with our derived table to focus our analysis during earnings periods:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "%%with_globals print\n",
    "%%bigquery --project {PROJECT}\n",
    "CREATE OR REPLACE TABLE stock_market.eps_percent_change_sp500\n",
    "AS\n",
    "SELECT a.*, b.consensus_EPS, b.reported_EPS, b.surprise\n",
    "FROM stock_market.percent_change_sp500 a\n",
    "INNER JOIN `stock_src.eps` b\n",
    "ON a.Date = b.date\n",
    "AND a.symbol = b.symbol"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "collapsed": true,
    "id": "OLYTEUstTfva",
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "%%with_globals print\n",
    "%%bigquery --project {PROJECT}\n",
    "SELECT *\n",
    "FROM stock_market.eps_percent_change_sp500\n",
    "LIMIT 20"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "%%with_globals print\n",
    "%%bigquery --project {PROJECT}\n",
    "SELECT direction, COUNT(*) as cnt\n",
    "FROM stock_market.eps_percent_change_sp500\n",
    "GROUP BY direction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "COPWKR1WTfvd"
   },
   "source": [
    "## Feature exploration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "T5HLcwy1Tfve"
   },
   "source": [
    "Now that we have created our recent movements of the company’s stock price, let's visualize our features. This will help us understand the data better and possibly spot errors we may have made during our calculations.\n",
    "\n",
    "As a reminder, we calculated the scaled prices 1 week, 1 month, and 1 year before the date that we are predicting at. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "RDROJ7qMh7oz"
   },
   "source": [
    "Let's write a re-usable function for aggregating our features."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Learning objective 2**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Q7dT9NTSTfvf"
   },
   "outputs": [],
   "source": [
    "def get_aggregate_stats(field, round_digit=2):\n",
    "    \"\"\"Run SELECT ... GROUP BY field, rounding to nearest digit.\"\"\"\n",
    "    df = bq.query('''\n",
    "    SELECT {field}, COUNT(*) as cnt\n",
    "    FROM\n",
    "    (SELECT ROUND({field}, {round_digit}) AS {field}\n",
    "    FROM stock_market.eps_percent_change_sp500) rounded_field\n",
    "    GROUP BY {field}\n",
    "    ORDER BY {field}'''.format(field=field,\n",
    "                             round_digit=round_digit,\n",
    "                             PROJECT=PROJECT)).to_dataframe()\n",
    "    return df.dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 338
    },
    "colab_type": "code",
    "collapsed": true,
    "id": "xgmCvlMtTfvh",
    "jupyter": {
     "outputs_hidden": true
    },
    "outputId": "5db92f71-8cce-4475-fad8-a991e2fc208f"
   },
   "outputs": [],
   "source": [
    "field = 'close_AVG_prior_260_days'\n",
    "CLIP_MIN, CLIP_MAX = 0.1, 4.\n",
    "df = get_aggregate_stats(field)\n",
    "values = df[field].clip(CLIP_MIN, CLIP_MAX)\n",
    "counts = 100 * df['cnt'] / df['cnt'].sum()  # Percentage.\n",
    "ax = values.hist(weights=counts, bins=30, figsize=(10, 5))\n",
    "ax.set(xlabel=field, ylabel=\"%\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**TODO** Use the `get_aggregate_stats` from above to visualize the `normalized_change` column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "collapsed": true,
    "id": "UDcnYJrCTfvj",
    "jupyter": {
     "outputs_hidden": true
    },
    "outputId": "7e949d16-8c9a-416c-de2d-28ba97a2aa65"
   },
   "outputs": [],
   "source": [
    "field = 'normalized_change'\n",
    "# TODO"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "DuV7glaEh7o_"
   },
   "source": [
    "Let's look at results by day-of-week, month, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "collapsed": true,
    "id": "l7egsYhcTfvm",
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "VALID_GROUPBY_KEYS = ('DAYOFWEEK', 'DAY', 'DAYOFYEAR',\n",
    "                   'WEEK', 'MONTH', 'QUARTER', 'YEAR')\n",
    "DOW_MAPPING = {1: 'Sun', 2: 'Mon', 3: 'Tues', 4: 'Wed',\n",
    "                5: 'Thur', 6: 'Fri', 7: 'Sun'}\n",
    "\n",
    "def groupby_datetime(groupby_key, field):\n",
    "    if groupby_key not in VALID_GROUPBY_KEYS:\n",
    "        raise Exception('Please use a valid groupby_key.')\n",
    "    sql = '''\n",
    "    SELECT {groupby_key}, AVG({field}) as avg_{field}\n",
    "    FROM\n",
    "    (SELECT {field},\n",
    "     EXTRACT({groupby_key} FROM date) AS {groupby_key}\n",
    "    FROM stock_market.eps_percent_change_sp500) foo\n",
    "    GROUP BY {groupby_key}\n",
    "    ORDER BY {groupby_key} DESC'''.format(groupby_key=groupby_key,\n",
    "                                      field=field,\n",
    "                                      PROJECT=PROJECT)\n",
    "    print(sql)\n",
    "    df = bq.query(sql).to_dataframe()\n",
    "    if groupby_key == 'DAYOFWEEK':\n",
    "        df.DAYOFWEEK = df.DAYOFWEEK.map(DOW_MAPPING)\n",
    "    return df.set_index(groupby_key).dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 403
    },
    "colab_type": "code",
    "collapsed": true,
    "id": "z7mxvIqYTfvp",
    "jupyter": {
     "outputs_hidden": true
    },
    "outputId": "df531585-b724-4907-a84a-03483efc9a7d"
   },
   "outputs": [],
   "source": [
    "field = 'normalized_change'\n",
    "\n",
    "df = groupby_datetime('DAYOFWEEK', field)\n",
    "ax = df.plot(kind='barh', color='orange', alpha=0.7)\n",
    "ax.grid(which='major', axis='y', linewidth=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 403
    },
    "colab_type": "code",
    "collapsed": true,
    "id": "BRI70WJpTfvs",
    "jupyter": {
     "outputs_hidden": true
    },
    "outputId": "b7525484-9b43-407b-f180-7de5ab7225a4"
   },
   "outputs": [],
   "source": [
    "field = 'close'\n",
    "df = groupby_datetime('DAYOFWEEK', field)\n",
    "ax = df.plot(kind='barh', color='orange', alpha=0.7)\n",
    "ax.grid(which='major', axis='y', linewidth=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 403
    },
    "colab_type": "code",
    "collapsed": true,
    "id": "40m-6nMKTfvw",
    "jupyter": {
     "outputs_hidden": true
    },
    "outputId": "4c4bd05b-2278-4eb7-a741-d39076ec59d3"
   },
   "outputs": [],
   "source": [
    "field = 'normalized_change'\n",
    "df = groupby_datetime('MONTH', field)\n",
    "ax = df.plot(kind='barh', color='blue', alpha=0.7)\n",
    "ax.grid(which='major', axis='y', linewidth=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 403
    },
    "colab_type": "code",
    "collapsed": true,
    "id": "Zj2pOrAiTfvz",
    "jupyter": {
     "outputs_hidden": true
    },
    "outputId": "4afba896-7e82-458a-86b0-4562dd31b5a4"
   },
   "outputs": [],
   "source": [
    "field = 'normalized_change'\n",
    "df = groupby_datetime('QUARTER', field)\n",
    "ax = df.plot(kind='barh', color='green', alpha=0.7)\n",
    "ax.grid(which='major', axis='y', linewidth=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 418
    },
    "colab_type": "code",
    "collapsed": true,
    "id": "ohYWA_YsTfv4",
    "jupyter": {
     "outputs_hidden": true
    },
    "outputId": "7b597812-82f6-41dd-98c4-f3ed1ec5dc6c"
   },
   "outputs": [],
   "source": [
    "field = 'close'\n",
    "df = groupby_datetime('YEAR', field)\n",
    "ax = df.plot(kind='line', color='purple', alpha=0.7)\n",
    "ax.grid(which='major', axis='y', linewidth=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 418
    },
    "colab_type": "code",
    "collapsed": true,
    "id": "BBTC2VunTfv2",
    "jupyter": {
     "outputs_hidden": true
    },
    "outputId": "3b21cb7d-4f80-4faf-a2c9-e67191969556"
   },
   "outputs": [],
   "source": [
    "field = 'normalized_change'\n",
    "df = groupby_datetime('YEAR', field)\n",
    "ax = df.plot(kind='line', color='purple', alpha=0.7)\n",
    "ax.grid(which='major', axis='y', linewidth=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "P7UJ0W-5WvZC"
   },
   "source": [
    "BONUS: How do our features correlate with the label `direction`? Build some visualizations. What features are most important? You can visualize this and do it statistically using the [`CORR`](https://cloud.google.com/bigquery/docs/reference/standard-sql/statistical_aggregate_functions) function."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "oRCY1E6CTfxD"
   },
   "source": [
    "Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "2_feature_engineering.ipynb",
   "provenance": [],
   "toc_visible": true
  },
  "environment": {
   "name": "tf2-gpu.2-1.m68",
   "type": "gcloud",
   "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-1:m68"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
